Ford GoBike is a regional public bicycle sharing system in the San Francisco Bay Area, California.
Ford GoBike, like other bike share systems, consists of a fleet of specially designed, sturdy and durable bikes that are locked into a network of docking stations throughout the city. The bikes can be unlocked from one station and returned to any other station in the system, making them ideal for one-way trips. The bikes are available for use 24 hours/day, 7 days/week, 365 days/year and riders have access to all bikes in the network when they become a member or purchase a pass.
This data set includes information about individual rides made in Ford GoBike bike-sharing system covering the greater San Francisco Bay area
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
!pip install plotly==5.9.0 --quiet
import plotly.express as px
%matplotlib inline
sns.set_style('darkgrid')
plt.rcParams['font.size'] = 12
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['figure.facecolor'] = '#00000000'
#loading data
rides_df = pd.read_csv('201902-fordgobike-tripdata.csv')
rides_df.head()
rides_df.shape
rides_df.info()
rides_df.isna().sum().sort_values()
rides_df.duplicated().sum()
rides_df.user_type.unique()
rides_df.member_gender.unique()
rides_df.bike_share_for_all_trip.unique()
from Assesment change datatype drop missing values
rides_df.dropna(inplace=True)
#convert to string
rides_df[['start_station_id', 'end_station_id', 'bike_id']] = rides_df[['start_station_id', 'end_station_id', 'bike_id']].astype(str)
rides_df['member_birth_year'] =rides_df['member_birth_year'].astype(int)
rides_df[['start_time', 'end_time']] = rides_df[['start_time', 'end_time']].apply(pd.to_datetime)
rides_df.info()
The Dataset has 183412 rows, and 16 columns
Average Age of Riders, Average Duration of Trips, Gender distribution of Riders, Ditribution of User type.
birth year, duration, gender and user type variables
#create an age column
rides_df['age'] = 2019 - rides_df['member_birth_year']
rides_df.head(3)
rides_df['age'].describe()
our age columns seems to have quite a number of outliers seeing that 75% of the individuals are below 39 years old. the maximum age there is 141 which is likely an error. let us visualize it for a clearer picture
plt.boxplot(rides_df['age'], vert=False)
plt.xlabel('Age')
plt.title('Distribution of Age');
As we can see from the boxplot above, we have quite a number of outliers. we have alot of people in the senior's category of age. this is certainly expected because seniors are encouraged to ride bikes as a form of excercise therefore we would consider these outliers as legitimate data point. however we doubt the possibility of someone riding a bike at age 141 or even at age 119. we have no such record of someone been alive at age 141 as at 2019 and even they existed, it would certainly be risky to allow them to ride a bike.
to deal with this, we will set our age limit for this analysis to be 100.
high = rides_df["age"] < 101
rides = rides_df[high]
rides.age.describe()
rides['age'].hist(bins= 20);
plt.xlabel('Age')
plt.ylabel('Count')
plt.title('Age Distribution');
From the histogram distribution plot above, we can see that most of the riders are within the age of 25 and 35.
rides['duration_sec'].describe()
rides['duration_sec'].hist(bins=500)
plt.xlabel('Duration[Sec]')
plt.title('Distribution of Trip Duration');
the trip Duration histogram is highly skewed due to the long duration of some trips. as a result of this, we will be using the median in answering other questions related to duration. we are using the median because it is not affected by outliers unlike mean.
fig= px.pie(rides, names='member_gender', width=600, height=300, title='Distribution of Gender')
fig.show()
A large percentage of the riders are male(74%). thrice as much as the female(23%).
According to the investigation carried out by Elizabeth Plank at the bike paths of New York City, Turns out way more men ride bikes than women: “In the U.S., 1 woman for every 3 men gets around on a bicycle”.
According to Plank “In London, 77% of bike trips are taken by men and only 5% of women identify as frequent cyclists.”
https://slate.com/human-interest/2014/09/gender-gap-alert-men-ride-bikes-way-more-than-women-do.html
#define a function to plot categorical feature
def plot_cat(var, l=8,b=5):
plt.figure(figsize = (l, b))
sns.countplot(rides[var], order = rides_df[var].value_counts().index)
#call function to plot countplot
plot_cat('user_type')
over 90% of riders in our dataset are subscriber. that means they pay subscription fee which could be monthly or year. only a small percentage of the riders are customers. that means they pay at the station or Kiosk per every trip.
plot_cat('bike_share_for_all_trip')
A large percentage of riders are not part of the program. we don't have detailed information to know the reason for this
rides.groupby('member_gender')['duration_sec'].median()
rides.groupby('member_gender')['duration_sec'].median().sort_values(ascending=False).plot(kind='bar')
plt.xlabel('Gender')
plt.ylabel('Duration[sec]')
plt.title('Average Duration of Trips for Gender');
Female go on longer trips (567 seconds or aproximately 10mins). though the difference is much from the trip duration for males.
rides.groupby('user_type')['duration_sec'].median()
rides.groupby('user_type')['duration_sec'].median().plot(kind='bar')
plt.xlabel('User Type')
plt.ylabel('Duration[Sec]')
plt.title('Average Duration of Trips for User Type');
Customers go on a longer trip (780 sec or 13mins) than Subscribers(490 secs or 9mins).
rides_df['ride_start_week'] = rides_df['start_time'].dt.week
rides_df.groupby('ride_start_week')['duration_sec'].median()
rides_df.groupby('ride_start_week')['duration_sec'].median().sort_values().plot(kind='barh')
plt.xlabel('Duration[Sec]')
plt.ylabel('Week of the Month')
plt.title('Duration of Trips per Week of the Month');
riders went on longer trip in the fourth week(8). the average trip duration for the fourth week was 532 sec or aproximately 8mins. though no much difference from other weeks
# Add a column for the weekday of the start of the ride
rides_df['ride_start_weekday'] = rides_df['start_time'].dt.day_name()
# Print the median trip time per weekday
print(rides_df.groupby('ride_start_weekday')['duration_sec'].median())
rides_df.groupby('ride_start_weekday')['duration_sec'].median().sort_values(ascending=False).plot(kind='bar')
plt.xlabel('Week Day')
plt.ylabel('Duration[Sec]')
plt.title('Average Duration of Trips on Weekdays');
Riders went on longer trips on Weekends (Saturdays and Sundays)
fig = px.scatter(rides_df, x='age', y='duration_sec', title='Duration vs Age')
fig.show()
There is no linear relationship between age and duration of a trip. However most people who took longer trips were between the age of 25 and 45
fig = px.scatter_mapbox(
rides_df, # Our DataFrame
lat='start_station_latitude',
lon='start_station_longitude',
center={"lat": 37.773972, "lon": -122.431297}, # Map will be centered on San Francisco
width=600, # Width of map
height=600, # Height of map
hover_data=['start_station_name'], # Display Station name when hovering mouse over station
title = 'Dsitribution of Stations'
)
fig.update_layout(mapbox_style="open-street-map")
fig.show()